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Loading and Transforming 



It is now time to turn our attention to Hie workhorse issues of document ware- 
housing: loading and transforming documents. At this stage, documents are 
preprocessed, if necessary, to ensure that they are in a character format and 
language appropriate for the tools that will later perform text analysis. Docu- 
ments are then indexed for both full text and themes. Depending upon the 
needs of document warehouse users, documents may also need to be classified, 
grouped with similar documents, and summarized. This chapter will examine 
each of these steps in the following order 

nn Internationalization and character set issues 
bo Translating documents 
Ea Indexing texts 

Classifying documents 

Clustering documents 
e® Summarizing text 

We will examine language differences in the first two topics, with an emphasis on 
the importance of the Unicode standards and on the uses — and limits— of 
machine translation. Since the basics of full text and thematic indexing have 
already been discussed in Chapter 4, we will now look into customizing indexing 
with specialized thesauri and stopiists. Classification of documents can be done in 
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Figure 8.1 An overview of document loading and transformation. 



two ways, using single labels or using multidimensional taxonomies. Similarly, the 
problem of grouping documents into clusters can be approached in at least three 
distinct ways, and we will examine those and discuss their appropriate uses as 
well, finally, we will look in detail at different approaches to text summarization. 
Figure 8.1 shows the basic steps in the loading and transformation process. First, 
preprocessing steps are performed to ensure documents are in a form suitable for 
text analysis. Then full text and thematic indexing is done followed by higher level 
text analysis operation, such as classification, clustering, and summarization. 



IMemationalizaf ion and Character Set Issues 



Rapid globalization of economies is bringing new issues right into the middle of 
many businesses operations. The European Community has had to deal with it 
since its inception, and has found the cost of publishing every official document 
in 1 1 different languages simply astounding. What does this mean to document 
warehouse practitioners? Mainly that they will now need to meet several dif- 
ferent types of multilingual needs, including: 

so Business intelligence sources in multiple languages 

eb Translations of internal documents, such as procedure manuals and user 
guides 

ra Government publications, including regulations and advisory notices f 
a Contracts and other interbusiness documents \5 



Business intelligence sources in different languages are essential for a business wifiv.;^ 
operations in multiple countries. If we depend upon domestic sources for informa^^ 
tion about aforeign business climate, then we run the risk of missing details not covfg 
ered by the domestic press. There is also the problem of relevant cultural difference^ 
that may be uncovered from resident sources that are lost in local coverage. 

Multinational companies have always had to deal with multiple translations oi 
documents. The document warehouse does not create additional problems h ifi|| 
just brings some issues to the forefront For example, if all documents withir^ 
the warehouse should be equally accessible (assuming appropriate ^ c ^j||| 
controls), then the warehouse will need to store, and client applications 
need to render, multiple alphabets. 
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Managing operations across borders also requires the ability to track and abide 
by the laws of each host country. While international agreements— such as the 
European Union regulations and the North American Free TYade Agreement- 
have eased international business, local regulations must also be tracked Some 
nations, such as Germany, have more extensive laws governing commerce than 
other countries. In these cases, not only do the local divisions of a company need 
to understand these laws, but other divisions that work with them should know 
them as well. For example, if a manufacturer's facility in Ireland is planning on 
increasing production, it could affect distribution centers in Germany. Wilf the 
German center need to increase staff or make capital investments because of lim- 
its on hours of business operations? While the details of such questions will prob- 
ably need to be addressed by the local division, understanding the operating 
environment of business units in other nations could prove to be an advantage. 

Of course, government regulations are not the only source of international docu- 
ments. Business-to-business dealings will generate plenty of contracts, agreements, 
and other documents that should be managed within the document warehouse. 

Meeting these and other needs will require that document warehouse designers 
and developers deal with two language-related issues: character sets and 
machine translation. 

Coded Character Sets ■ 

Coded character sets are used to represent alphabets in computers. Since digi- 
tal computers fundamentally represent all information using binary numbers, 
characters need to be mapped into a numeric representation. The two most 
commonly used are ASCII and Unicode. The former is the older of the two and 
has long been used in countries with the Latin alphabet Unicode was devel- 
oped in response to the need to represent other types of alphabets. Syllabaries 
use a single character to represent a syllable, such as the kana system used in 
Japanese to supplement the Chinese characters used in the language. Another 
alphabet form is ideographic These systems, such as Chinese, use a single text 
element to represent an entire word As Figure 8.1 depicts, modern-day writing 
systems have evolved and branched off from a variety of earlier systems. The 
design goal of Unicode was to represent any text element from any language; 
consequently it is a much richer character set than ASCII. 

ASCII Characters 

The American Standard Code for Information Interchange (ASCII) was origi- 
nally a 7-bit character set able to represent 128 letters, numbeis, and symbols. 
The current de facto standard is 8-bit ASCII, which can represent 256 charac- 
ters. The high-byte characters (from 128 to 255) do not have a standardized 
character mapping and have been used for formatting features, such as italics, 
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Language identification is the first step in managing language issues. There are 
several options in language translation, and— as usual— your choice depends 
upon your particular needs. Finally, since translations yield new documents, 
warehouse designers will need to specify how these new documents are treated 
within the warehouse. 



Language Identification 

The three main ways in which language is identified in document warehousing 
and other text mining operations are: 

» Language identification programs 
m Search engine restrictions 
«a Document metadata 

Each method has its advantages, and all three can be used reliably in the docu- 
ment warehouse. 

We humans can quickly identify text written in our own language, even if we do 
not understand the content. Take the following example from Gray's Anatomy: 

The part of the choroid plexus seen in the descending corau is formed in exactly 
the same way, viz, by an ingrowth of the vessels of Hie pia mater into the cavity, 
pushing the ependyma before it, at a part of the waD of the horn where there is a 
similar absence of nervous tissue where it consists simply of pia mater and 
ependyma in close contact (Henry Gray. 1977. The Classic Collector's Edition 
Gray's Anatomy, New York Bounty Books) 

Although many of the terms are foreign to most of us, there are enough linguis- 
tic clues to know that this is English. First, the Latin alphabet is used. Second, 
.common English words such as the, in, at, of, there, and where appear through- 
out the passage. Finally, there are morphological clues. The woid formed ends 
in -ed, making it likely a past tense verb. It is closely followed by a word ending 
in -ly making lhat word a likely adverb and increasing the likelihood that the 
word ending in -ed is in fact a verb. Just as humans can identify a language with- 
out understanding the text, so can text analysis programs. 
Language identification programs are generally trained with a sample set of 
documents in a particular language. Using frequently occurring words and 
character sequences, these programs can develop profiles of languages and 
reliably identify a document's source language. The language identification tool 
in the IBM Intelligent Miner for Text suite, is preconfigured to identify 14 
languages: 

Brazilian 
Catalan 
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Danish 

Dutch 

English 

Finnish 

French 

German 

Icelandic 

Italian 

Norwegian 

Portuguese 

Spanish 

Swedish 

The statistical techniques that are used with language identification tools gen- 
erally allow for users to develop identification profiles for oilier languages. This 
process usually entails creating training sets of documents in the target lan- 
guage and running the language identification program in a training mode. 

The second method for identifying languages is to take advantage of search 
engine options to restrict searching to a specified language. All documents that 
are returned from those searches are guaranteed to be in the selected language. 
Of course, this technique does not help when dealing with internal documents 
but a similar principal applies. Extraction programs that collect documents 
may be written to target servers where a single language predominates. For 
example, a multinational firm can use different processes to collect documents; 
from their London, Rome, and Amsterdam sites so that documents in different 
languages are kept partitioned before being loaded into the warehouse. j: 

The third method is to use document metadata. Hie Dublin Core metadata stem 
dard includes a language specification for the specified document For exanfe 
pie, a fictional introduction to text mining might include the following metadafi| 
specified by the Dublin Core and implemented using the Resource Description* 
Format (RDF) 

<?xml version="1.0"?> 
<rdf:RDF 

rains : rdf ="http: //www . v/3 . org/1999/02/22-rdf -syntax-nsS" 
<dc:title> Introduction to Text Mining</dc:title> 
<dc:creator> Mary Jones </dc : creator> 
<dc:creator> Bob Smith </dcrcreator> 
<dc:subject> 

Text Mining ; 
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Clustering; 
Summarization ; 
Feature Extraction 
</dc:subject> . 

<dc: published Association of Text Miners </dc;publisher> 
<dc:date> 2000-08-15 </dc.date> 
<dc:format> text/html </dc:format> 
<dc : language> en </dc : language> 
</rdf : Description> 
</rdf :RDF> 

The XML entity <dc : ianguage> en </dc : ianguage> identifies English as the 
document's language. Now, it should be noted that the metadata could be spec- 
ified in other than the language of the document For example, the tag <rdf : ii 
xml :iang= > identifies the language of the metadata, allowing document cre- 
ators to describe the contents of the document in multiple languages, thus aid- 
ing searchers using those other languages. 



Language Translation 

If language translation is supported in a document warehouse, additional pro- 
cessing and flow of control support is required. As Figure 8.2, shows, once doc- 
uments have reached a staging area the following steps must be performed: 
bs Identify the language of the document 

h If the language is to be translated, determine if manual or machine transla- 
tion will be used. 

sa If manual translation is selected, add the document to the manual transla- 
tion queue for the document's language. 

ra If machine translation is selected, execute the translation program. 




gure 8.2 Translation adds several steps to processing the document stream. 
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m If the full document is to be stored, add it to the document stream for fur- 
ther processing. 

m If only a summary of the translated document will be stored, execute a 
summarization program and add the summary to the document stream for 
further processing. 

Of all these steps, the choice of manual versus machine translation is perhaps 
the most important Manual translation offers the best quality translation but is 
slower and significantly more costly than automatic translation. Machine trans- 
lation is generally faster but, in general, the reader will only get the gist of the 
document without a thoroughly accurate translation. 

An alternate methodology to the one described above is to store documents in 
their native language, translate queries, and then provide summaries in the lan- 
guage of the query If the document is of sufficient interest to the reader, then it 
can be completely translated This approach is most appropriate when auto- 
mated translation is of insufficient quality, and the cost of translating a large 
volume of documents is prohibitive. 

Manual Translation 

Manual translation is sometimes the best option for ensuring high-quality 
jnn in the documen t warehouse. Machine-aided translation (MAT) 



provides some automated support for humans through the use of online dic- 
tionaries, morphological analysis, and other text processing tools. In the case 
of MAT, human translators can increase productivity while still controlling 
quality. 

Another option with regard to manual translation is to let a translation program 
make a first pass at the translation, and then have the human translator finalize 
the translation. Again, the final quality assurance measures rest with the human 
translator. As Figure 8.3 shows, there are different options for configuring a 
manual translation environment. 

Machine Translation v u 
The early and persistently elusive goal of machine translation is fully auto*;| 
matic high-quality translation (FAHQT). The ideal translation system works;*? 
independently of humans yet produces translations at least as good as $M 
human translator. To accomplish this task, the translation system must dea|g 
with ambiguity, polysemy, and idioms,, among other challenges. Needless 
say, we have not yet achieved FAHQT. What has been discovered, however, i 
that there is a definite tradeoff between the complexity of the translation sy||| 
tern and the quality of the translation. Three general approaches, in increasing 
level of complexity, are: ~' & 
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(a) Manual Translation 
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(b) Machine Aided Translation 

Figure A3 Manual translation can be done independently (a) or with the use of machine-aided 
-translation tools (b). 



eb Direct translation 
a Transfer approach 
bh Interlingua approach 

Direct translation was the earliest, arid continues to be the most common, 
design strategy. The transfer approach and interlingua approach were both 
developed to overcome limitations of earlier approaches. 

Direct translation uses a word-for-word approach to translation. As figure 8.4 
shows, a text in a source language is mapped to a target language, using a bilin- 
gual dictionary and basic rules for reordering phrases. 



Source 
Language 




Figure BA Direct translation is the most basic translation technique, using only a dictionary and 
some rewrite rules. 
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Table 8.1 English-to-Spanish Substitution Rules 




Although fast and efficient, direct translation does not perform any analysis of 
content or try to resolve ambiguities. It has been successfully used with 
languages that have similar grammatical structures, for example, English and 
Spanish. One study of a commercial machine translation program (Gimenez 
and Forcada 1998) found the substitution rules in Table 8.1 were used in an 
English-to-Spanish translation program 

The transfer approach improves upon the basic dictionary lookup philosophy 
of the direct approach by adding syntactic analysis. As Figure 8.5 shows, the 
second step of this method— the transfer— uses syntactic rules to determine 
the sentence structure of the target sentence. In the final phase, morphological 
rules are applied to create the final word forms in the target language, and 
grammar rules are applied to determine the appropriate phrase and sentence 
structures. like the direct approach, the transfer method works only on a sin- 
gle sentence at a time and does not perform semantic analysis. The technique 
with the most emphasis on semantics is the intexiingua approach. 

The interlingua approach uses a special language-neutral stage, called the 
interlingua. The purpose is of the interlingua is to act as a universal semantic 
representation scheme. Rather than mapping words from the source language . 
onto words in the target language and then rearranging word order according * 
to the grammar of the target language, the interlingua represents the meaning 
of the source text and then synthesizes text in the target language. Figure 8.6 h 
shows the basic structure of the interlingua approach. :* 




Figure 83 The transfer approach includes syntactic analysis and source-to-target 
transfer rules. 
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Aiuiougn tneorencauy appealing, uas metnod Has not yielded significant 
results (Hausser 1998). The most significant problem has been finding a suit- 
able interlingua. Proposals include a logic-based language, an artificial lan- 
guage such as Esperanto, and a set of semantic primitives. The use of both 
logic-based languages and semantic primitives have been extensively studied 
in Artificial Intelligence (AT), but a comprehensive representation scheme 
built on these approaches has yet to be developed. 

Partial Translations of Structured Texts 

m addition to translating entire texts, semistructured texts lend themselves to 
partial translations. For example, in a financial report it may be sufficient to 
translate only column and row headings used in tables of financial data. For 
longer documents, abstracts or executive summaries could be translated by 
machine to provide Hie main gist of the document to the reader, while leaving 
the rest of text to be translated only if there is a specific need. 



Limits of Machine Translation 

Whether you choose full or partial translation, there are limits to the quality of 
the translation that end users should be aware of. 

First, not all terms used in a source document will have entries in the bilingual 
dictionary. This is especially true for scientific and technical terms. Many soft- 
ware packages do, however, allow users to add additional terms for specialized 
vocabularies. 

Second, outside of restricted language domains, translation requires some 
semantic understanding. For example, machine translation systems have been 
successfully used in Canada to translate weather forecasts, a domain with a 
limited scope, from English to French. A similar attempt to develop a machine 
translation system for aviation hydraulics was abandoned after three years 
(Klein 1999). 




: igore 8.6 Semantic representation in a common meaning representation scheme distinguishes the 
nteriingua approach from the direct and transfer methods. ° 
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A third problem with machine translations is known as lexical gaps. This 
occurs when one language has a single word that can be translated into two 
or more words in a target language. For example, the English word know 
translates into both savoir and connaitre in French and into wissen and ken- 
Tien in German. Deterndning the correct translation requires a semantic 
understanding not usually found in current machine translation systems. One 
approach to dealing with this problem is to allow a user to choose amon« 
alternate translations. ° 



Document Storage Options 

Translated documents add another dimension of complexity to the document 
warehouse We now have multiple versions of the same document in the sense 
that each translation conveys the same semantic content Some of the options 
that warehouse designers have are: 

■ Storing the original text and all translations 

m Storing the full text of the original and only a summary of the translated 
document 

™ Storing only a summary of all versions, including the original 
™ " Storing the translation but not the original 

How you choose between these and other options depends upon the impor- 
tance of the document, the cost of retranslating if necessary, and any potential 
quality control and legal issues. 

» The simplest solution— from a design perspective— is to store the full text 
of the original document as well as all translations. The advantage of this 

. choice is having the original on hand in case there are questions about the 
translation at a later time. It also provides translations of the fun text, so 
readers would not need to translate the entire document if only a trans- 
lated summary was stored. 

s» Storing the full text of the original and translations of the document sum- 
maiy is another option. In this scenario, space use is optimized, and read- if 
ers can stfllget the gist of a document from the summary. Since only <f 
summaries are translated, this could reduce the translation load by as $ 
much as 80 percent Similarly, the original do<mment may not be important^ 
enough to warrant storing the full content in the warehouse, and in this J| 
case, a summary-only scenario is appropriate. 

bb Finally, having news stories, press releases, and other noncritical docu- j» 
ments in the original language may not add any substantive value to a do^M 
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ument warehouse. A mobile phone manufacturer announcing a new line of 
products in Finnish may not need to be included in the warehouse when an 
English version meets the needs of end users. 

oa In general, the most important documents — such as contracts, legal opin- 
ions, and government regulations and reports— warrant multiple lan- 
guage versions in a document warehouse, especially when machine 
translation techniques are used. Readers may get the main point of a doc- 
ument from a machine-translated rendition, but details and subtle points - 
can easily be missed or misrepresented in an automatically generated 
translation. Because of these limitations, it is important for users to 
understand how the document was translated and any shortcomings of 
that method. 



fexing Text' 



Indexing text allows us to efficiently search for documents relevant to a query 
without examining entire documents. In this way, text indexing is similar to 
conventional database indexes, which allow us to forgo full table scans for 
more efficient retrieval of rows of data. The two types of indexing we are pri- 
marily concerned with are full text indexing and thematic indexing. 



Full Text indexing 

Full text indexing occurs automatically in many text analysis tools when docu- 
ments are loaded. Indexes will generally record information about the location 
of terms within the text so that proximity operators, as well as Boolean opera- 
tors, can be used in full text queries. The most common operators in text 
queries are: 

hb Boolean Operators 

« AND 

m OR 

n NOT 
aa Proximity Operators 

m NEAR 

a WITHIN 

Indexing supports Boolean operators by allowing those operations to be per- 
formed on the indexes without full document searching. Since the indexes 
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maintain information about the position of words within a text, proximity oper- 
ators can also make use of indexes. The NEAR operator can be implemented in 
several ways. First, it can return a score relative to the distance between two 
specified terms. It can also be limited to searching for pairs of words with a 
maximum number of words between them, finally, the order of terms in a 
NEAR query may or may not be relevant 

To maintain the efficient implementation and use of full text indexes, frequently 
used words, called stop words, are often ignored. Stop words appear so com- 
monly in discourse that they do not add any value to document searching and 
can be safely ignored 

Some tools, such as Megaputerfe Text Analyst automatically calculate the lexi- 
cal affinities that measure the co-occurrence of words. Words that appear 
together such as real estate, mortgage rate, and data warehousing will be iden- 
tified as lexical affinities. Using lexical affinity measures can improve full text 
searching by helping to disambiguate words with multiple meanings such as 
bed in flower bed or queen-sized bed. 

hematic Indexing 

Thematic indexing depends upon the use of thesauri. A thesaurus is a set of 
. terms that define a vocabulary and are organized using relationships. It pro- 
vides a hierarchical structure that allows text mining tools to quickly find gen- 
eralizations as well as specializations of specific terms. The ISO-2788 standard 
for monolingual thesauri is the most commonly used standard for thematic 
indexing and consists of four main components: 

eh Thesaurus 
Ha Indexing term 
™ Preferred term 
na Nonpref erred term 

Indexing terms are either a single word or a compound term representing a coifc* 
cept in the thesaurus. Preferred terms are the terms used when indexing conj 
cepts. For example, automobile could be the preferred term for car, van, an4! 
minivart, which are considered nonpreferred terms. 

Preferred terms are organized hierarchically. Nonpreferred terms* are tied to ft" 
hierarchical structure by their reference to a preferred term. Preferred ter 
are related to each other by relations that define the hierarchy. The ISO-21 
standard defines the following relations: 
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sa usjSThe term that follows is a preferred term. 

ra UF: The term that follows is used for a preferred term. 

a Top term: This specifies the name of 1he broadest class to which a term belongs. 

m B'L This defines a broader, generalized term for a specified word 

Ob NT: This defines a narrower term that specifies another term 

m RT: The related term operation associates words that are not synonyms or 
quasisynonyms of a given term. 

Some tools, such as Oracle mterMedia Itext, are preconfigured with a thesaurus 
and can be used immediately to thematically index text Others took-and 
some text mining applications— win require custom thesauri Figure 8.7 shows 
a sample thesaurus using standard terms and relations. 

Mth a thesaurus, applications will be able to search by topic as well as by full 
text It is highly recommended that all document warehouses provide this basic 
service. Thematic indexing reduces thepoorprecisionandpoorrecall associated 
with polysemy (words having multiple meanings) and synonymy (multiple words 
for the same meaning). 



Company 

NT Corporation 

NT Sole Proprietorship 

NT Limited Liability Partnership 

SYN For-profit Organization 

Organization 

NT Company 

NT Non-profit Organization 
NT Government 
Government 
NT Federal Government 
NT Regional Government 
NT Municipal Government 
Regional Government 
SYN State Government 
SYN Provincial Government 



Figure 8.7 A sample thesaurus in ISO-2788 format 
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tfument Classification 



Full text and theme indexing are usually implemented to support ad hoc search- 
ing, but they also provide the basis for document classification. By looking at the 
pattern of words and themes, we can develop a rough partitioning of documents 
into a predefined set of groups. Examples of such groups include: 

bb Industry sector news stories 

™ Regulatory notices 

m Project-related documents 

sa Product-specific technical documentation and manuals 
■a Financial reports 

These rough partitions can be further refined as necessary. For most document 
warehouses, two types of classifications win be used: labeling and multidimen- 
sional taxonomies. 



Labeling is the process of assigning a dominant theme or topic descriptor to 
a document The labels chosen may be domain dependent or in general 
categories — such as the ones found in Oracle interMedia Ttextfs knowledge cat- 
alog. For finer classifications, multiple labels can be assigned along with 
weights. For each document, a list of labels and weights are assigned: 

Document" [ (label w weight t ) , 



The labels and weights can be used with text querying tools to specify minimum 
thresholds when looking for documents. For example, the following code can 
be used to return the document identifier and title of documents about cur- 
rency exchanges with at least a weighting of 0.5: 

SELECT 

Documented, Title 
FROM 

Documents 
WHERE 

ABOOT(text column, * currency exchange') >0.5 ^ 

-if 

Labels without weights can also be used to populate the SUBJECT field of the 
Dublin Core metadata set kept for document Ideally, the document creator 



Labeling 



(label 2 , weight 3 ) , 



(label,, weigtho)] 
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specify subject labels, but if these lables do not exist or do not conform to the 
preferred terms used in the document warehouse thesauri, then classification 
labels can be assigned. 



Labeling-How It Works 

Successful labeling depends upon three types of data; 
« Word frequency statistics 
as Morphological knowledge 
°» Type-specific terms 

Two word frequency statistics are necessary: relative frequency and absolute 
frequency. Relative frequency measures the number of times a word appears in 
a document Absolute frequency measures the number of times a term appears 
in a set of documents. Depending on the classification tool, absolute frequency 
might be calculated over a broad range of documents and the statistics pro- 
vided along with the tool. In other cases, tools can be trained by using sample 
documents provided by document warehouse designers and text mineis. 

Morphological knowledge is used to determine the preferred (or canonical) rep- 
resentation of a term For example, the canonical form of eat, ate, eats and eat- 
%ng is eat. Morphology is used to elirrrinate the variations that occur in language 
such as tense, plurality, and, in some languages, noun declinations and verb con- 
jugations. So no matter how the root of the word is modified to meet the gram- 
matical rules of the source language, it will be identified as a single term 

Type-specific terms are used to augment general lexicons and thesauri These 
extra terms include names of cities, states, provinces, and other geographic ref- 
erences as well as common abbreviations, names of clients and customers/and 
other company- or domain-specific terms. 

Once morphological analysis renders words into a standard canonical form, rela- 
tive frequencies can be calculated. The most common measure for determining 
the weight of a term in a document is the inverse document frequency measure 
The basic idea behind the measure is that high weights should be assigned to 
terms that appear in few documents because these are good discriminators. Since 
relative frequency measures the number of times a word appears in a document, 
. its weight win be proportional to the relative frequency. Terms that appear in 
many documents have a high absolute frequency and indicate poor discrimina- 
tors. In these cases, the weight is inversely proportional to the absolute frequency. 
The combination of terms and weights has proven to be a powerful technique 
for classifying documents. One limitation of labeling, though, is that it does not 
generalize. For example, a document labeled automobiles, trucks or buses or mil 
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fraxispartaizou is aiso aoouc gn/wrut vt umojjvi tt»M/i>. m.vu- 

one cannot easfly query for generalized concepts such as ground transportation. 



SVSuStsdimensional Taxonomies 

The idea of a multidimensional classification structure is well known to data 
warehouse practitioners. Ralph Kimball and others have developed the multidi- 
mensional model into an effective tool for organizing large quantities of numeric 
data in data warehouses. Multidimensional models allow us to quickly and eas- 
ily target a subset of the database that interests us, using major structural cate- 
gories) such as time period, customer, product, and location. Similarly, with 
multidimensional taxonomies, we can quickly and easfly target a subset of a doc- 
ument set by using classification categories, as shown in Figure 8.8. 



Classifiers 




Classified 
Documents 



Figure tL8 A partial sample taxonomy for classifying a broad range of documents. 
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Taxonomies can be created using specialized taxonomy-generation tools or 
with a combination of clustering and feature extraction, as described in Muller 
et al. (1999). Given a taxonomy, we can classify documents wife both specific 
terms, as in the case of labeling, and hierarchical categories. The net effect is 
equivalent to drillmg-up hierarchies in a multidimensional data warehouse 
Thus, we can query for documents about ground transportation, and we can 
find documents about automobiles and rail transportation. 

When dealing with taxonomies, it is useful to distinguish two concepts: the 
intention of a term and the extension of the term. The intention of a term 
describes the term abstractly, by relating it to other abstract terms. For example 
automobile is atype of ground transportation. The extension of a term is the set 
of documents (in our case) that are about that particular term. For example the 
extension of the term automobile might be documents with document IDs 1001 
2387, 11183, and 93321. The extension, thus, points to actual documents which • 
instantiate the concept of automobile. With these definitions in hand, we can 
now proceed to discuss how multidimensional taxonomies are used within the 
framework of the document warehouse. 

From the classification problem perspective, multidimensional hierarchies 
classify particular documents (the extension) into multiple categories at multi- 
ple levels of generality (the intention)^hus providing a richer classification 
scheme than labeling alone. 



f^cument Clustering 



Document clustering may be useful for some applications, such as quickly find- 
ing similar documents and exploring the macrostructures of a large collection 
of documents. Clustering can also help identify duplicate documents in the 
warehouse so they may be removed. Unlike classifications, clustering does not 
presume a preexisting set of terms or a taxonomy that is used to group docu- 
ments. Instead, groups are created on the basis of the features of documents 
within the set of documents being clustered. Although this technique is not as 
common as thematic indexing or summarization, it may prove useful to some. 

Many techniques have been developed for document clustering, but we will 
• concentrate on three main types: 

«a Binary relational clustering 

e» Hierarchical clustering 

a Self-organizing maps (SOM) 

Binary relanoralclustermg partitions aset of documents into groups, with eachdoc- 
ument in a separate group. Hierarchical clustering groups documents at multiple 
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levels, providing drill-up and drill-down navigation. Self-organizing maps are espe- 
cially useful for document sets covering a broad range of topics, such as e-mails. 



Binary Relational Clustering 

like other clustering techniques, the main objective of binary clustering is to 
group documents so that the similarity measures between documents in a clus- 
ter is maximized, while the similarity between documents in different clusters 
is minimized. The dominant features of binary relational clustering are: 

w Clusters are flat 

hi Documents are in only one cluster. 

w Clusters correspond to a single topic 

Binaiy relational clustering works by assigning documents to a single cluster, 
much as labeling assigns one classification to a document like labeling, each 
cluster corresponds to one topic, which is basically the set of common features 
shared by all documents in a cluster. For example, a cluster with documents about 
Windows NT, Windows 96, DOS, VMS, UNIX, and Iinux corresponds to an operat- 
ing system cluster: As Figure 8.9 shows, binaiy relational clustering groups docu- 
ments on the basis of similarity threshold and a predefined number of clusters, 
and does not guarantee a balanced distribution of documents over all clusters. 

Hierarchical! Clustering 

Hierarchical clustering groups documents together according to similarity mea- 
sures in a tree structure. As Figure 8.10 shows, documents can be in multiple 
clusters in a hierarchical clustering scheme. Rather than finding the single best 
match between a document and a cluster, hierarchical clustering algorithms 
iteratively group documents into larger dusters. 

The basic algorithm works as follows: First, assign each document to its own 
cluster. These are the leaves of the tree. Then create the second level of the tree 
by merging two clusters at a time, grouping them according to similarity. Create 
the third level by grouping pairs of clusters from the second level, and so on, until 
all groups have been merged into a single cluster at the root of the hierarchy. 

One of the advantages of hierarchical clustering is that it supports browsing by 
drill-down and drill-up operations. { 



Self-Organizing Map Clustering 



A third clustering technique uses a neural network to map documents in d< 
ment sets that have many possible topics (that is, the document space is hi) 
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Figure 8.9 Binary relational clustering renders a flat partitioning of a set of documents. 



dimensional), where each document has only a small number of those topics 
(that is, the document space is sparse). - 

like the other clustering technique, self-organizing maps (SOMs) depend upon 
a simuarity measure. Unlike the other techniques that compare documents to 
each other, SOMs compare the similarity of a document to a point on a two- 
dimensional grid, as depicted in Figure 8.11. The grid is created initially and 
populated with weighted feature vectors. The similarity measure compares the 
distance between a document and each the feature vector, corresponding to 
the point on the grid. After finding the closest match, the algorithm adjusts 
the weights of the feature vectors at the grid point to move it a little closer to 
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Figure 8.10 Hierarchical clustering groups documents into multiple dusters 



the document just added to the cluster. The amount of adjustment is controlled 
by a rate of learning parameter specified when the clustering program is run. 

SOMs have been used successfully with large newsgroup collections, which are 
generally considered difficult to analyze because they are frequently filled with 
short, ungraxnmatical pieces of text (Kohonen 1998). 

Clustering techniques will prove useful when laying to understand the overall 
structure of a document set and for some maintenance operations, such as 
detecting duplicates. 



imarizSng Text' 



The goal of summarization is to reduce the length and complexity of a do 
ment while maintaining its meaning. The two basic methods of summarizatioil 
are summarization by abstraction and summarization by extraction. When 
humans summarize, we generally read the entire text, develop an understand 
ing of the main ideas, and then write a coherent summary of the text Thi&| 
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{doc2, 
doc7} 






{doc9} 



- ^mensiona. grids to organize the dustering of 



smtumnzafton by abstraction and is beyond the abilities of automated meth- 

a summary without understanding the meaning of the text 1W TZf 

approachestosunuuarybye^actionhavebeeu^rop^ 

oa Paragraph extraction 

bo Sentence extraction 

aa Sentence segment extraction 



Each technique has distinct advantages, as we shall 

asic Summarization Methods 



now discuss. 



All three methods determine me most important terms using the same tech- 
niques used for document classification and Hue**™* t„«T . tectv 
extraction, ent^ paxagrapTare^ed^t^f t * ecaseof P^Ph 

£>e primary advantage of this method is SKI^ZS 2 ^mS 
rhetoncally coherent of the three approaches. 
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Sentence extraction works similarly, but at the sentence level instead of the 
paragraph level In this approach, less text is retained in the summary since 
unimportant sentences within a paragraph are discarded. Sentences are usually 
ordered according to their relative weights, so it is not uncommon to lose 
rhetorical consistency. For example, a sentence that begins with In conchision 
... could appear before a sentence that begins, First, there is the issue of.,.' 
This problem can be avoided by ordering sentences in the same order in which 
they appear in the document, rattier than by their weights. 

Sentence segmentation drills down even farther to work at the clause level. 
With this technique, a sentence is divided into segments by looking for cue 
phrases. Each segment conveys a single idea, such as interest rates are rising. 
Segments are separated by cue phrases like because or that, as in interests 
rates are rising because the Federal Reseive is concerned about inflation. The 
primary advantage of sentence segmentation is that is removes clauses within 
sentences that do not convey important information. like sentence extraction, 
this technique can suffer from poor rhetorical cohesion. 

Dealing with Large Documents 

While all three methods for summarization by extraction will produce suitable 
summaries for most texts, there are special issues that must be addressed with 
large documents. Prior to summarizing, large documents may need to be pre- 
processed in one of several ways, through: 

tst Document partitioning 

eg Tabular data extraction 

hi Targeting structured elements 

Document partitioning uses knowledge of document types to identify semantk 
cally distinct sections of document Many business and government documents; ' 
contain both text and numeric data that is essential to understanding the meaiw 
ing of text and needs to be addressed when summarizing. Finally, semistru<x| 
tuned texts can provide additional clues about important elements of a>\ 
document and may need to be extracted during document parsing. 

Document Partitioning 

Large documents are usually divided into logical sections. For example,. 1 , 
business plan will provide a discussion of the business organization, fiii&t 
cial plans, and marketing information. Project plans might describe the titi^E 
ness problem solved by the project, the proposed team structure, duratfi 
and funding. Summarizing across logical boundaries can cause problel 
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when the relative importance of a document section is not reflected bv the 
section's length. J 

Recall that the importance of a term is measured by its relative frequency in a 
document and measured inversely proportional to the term's appearance in the 
larger documentset; thus, the number of times a term appeals in a document will 
control its measure of importance. For example, the marketing section of a busi- 
ness plan may be a dominant section of a business plan so terms such as -market 
segment appear frequently. Since this is a relatively uncommon term across all 
documents, it will have a low absolute frequency and thus be considered an 
important ^term in the document Now a term such as long-term debt may appear 
just as infrequently over the set of all documents and, thus, have a similar 
absolute frequency to the term market segment. However, if the financial plan 
section comprises primarily short texts and tabular numeric data, then terms 
such ^long-term debt will have a low relative frequency, making their appear- 
ance in the summary less likely. «w«u 

To avoid this problem, semantical independent document sections may 
need to be separated into separate documents before summarizing. The result 
of the individual summarization operations can then be merged to create a 
semantical^ adequate document summary. XML is an ideal tool for document 
partitioning 

Tabular Data Extraction 

Since numeric data can often explain a fact much more concisely than a verbal 
description, tables of data are often embedded within documents to help con- 
vey apomt Summarization techniques are not designed to deal with numeric 

^SJTf* « •*? BXtiaCt th6Se ***** before summarizing and merge the 
extracted tables with the sununarizers output 

Table extraction can be done either with generic programs for report reformat- 
ting (sometimes called "screen scrapers") or with custom programs. If the tab- 
ular data is relatively well structured and consistent across a large number of 
documents, then the screen-scraping approach may be the most efficient If tab- 
ular data wffl vary in location and complexity, then a scripting language with 
strong support for regular expressions, such as Peri or Python, can provide the 
flexibflily needed to build a robust extraction routine,. but at the expense of 
writing and mamtaining a custom program. 

Targeting Structured Elements 

With the increasing popularity of XML and derived standards, we can expect 
to find more and more documents in the warehouse with these stracturing 
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elements. Document warehouse design — 

ments in several ways. 

+u ucdntf asummarizationby extraction program to create a sum- 
^ ^SScS^ from the XML document, such as an J 
mary, one ^^^^^^ Multiple summaries could also be are- A 
executive f^^^f^eeTkancJanalysts couldbe provided with,! 

^pX^Se — 

ofltoeofter major sections. * 
, , ~h™vt sections of the document might be summarized 

^^X^aumuutof.^matmustt.ar^b^™ 

. „ a „ ^lace the contents of some sections of the XML dc 

'!SSSS?bS22^J5- sernisfcuctured document, ^«^| 
ment and thus keep me Densu ^ approach is useful wti|| 




elusions 



* wurfnrmine documents is a critical and complex operatioj 
Loading ^war^e^elnSsteps begin with preparing for documeng 

L^r^drmmtiple character sets. Preprocessing steps # 
SSSS ~r i translating documents, extracting tab| 

^rsumrm^^ 

OT ^J^^rSo^rchJng Since documents cover multiple tomef 

c oH^^re an important aid to end users because they reduce the conggg 
^T^ZuZ^ S^^ counts of informatioa Automate sun« 

^JSfflcsnt amounts of text can improve the quahty of the final sunu|- 
m^fication and clustering do not strictly transform documents but m 
wSC^outdocn^ 

is mT users have explicit representations of unphat relati^ 

between documents. 

a number of operations are required during the load and ttansj| 
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